Goto

Collaborating Authors

 Absheron-Khizi Economic Region


Efficient and Flexible Topic Modeling using Pretrained Embeddings and Bag of Sentences

Schneider, Johannes

arXiv.org Artificial Intelligence

Pre-trained language models have led to a new state-of-the-art in many NLP tasks. However, for topic modeling, statistical generative models such as LDA are still prevalent, which do not easily allow incorporating contextual word vectors. They might yield topics that do not align very well with human judgment. In this work, we propose a novel topic modeling and inference algorithm. We suggest a bag of sentences (BoS) approach using sentences as the unit of analysis. We leverage pre-trained sentence embeddings by combining generative process models with clustering. We derive a fast inference algorithm based on expectation maximization, hard assignments, and an annealing process. Our evaluation shows that our method yields state-of-the art results with relatively little computational demands. Our methods is more flexible compared to prior works leveraging word embeddings, since it provides the possibility to customize topic-document distributions using priors. Code is at \url{https://github.com/JohnTailor/BertSenClu}.


Analytical Formulations for the Level Based Weighted Average Value of Discrete Trapezoidal Fuzzy Numbers

Nasiboglu, Resmiye, Abdullayeva, Rahila

arXiv.org Artificial Intelligence

In fuzzy decision-making processes based on linguistic information, operations on discrete fuzzy numbers are commonly performed. Aggregation and defuzzification operations are some of these often used operations. Many aggregation and defuzzification operators produce results independent to the decision makers strategy. On the other hand, the Weighted Average Based on Levels (WABL) approach can take into account the level weights and the decision makers optimism strategy. This gives flexibility to the WABL operator and, through machine learning, can be trained in the direction of the decision makers strategy, producing more satisfactory results for the decision maker. However, in order to determine the WABL value, it is necessary to calculate some integrals. In this study, the concept of WABL for discrete trapezoidal fuzzy numbers is investigated, and analytical formulas have been proven to facilitate the calculation of WABL value for these fuzzy numbers. Trapezoidal and their special form, triangular fuzzy numbers, are the most commonly used fuzzy number types in fuzzy modeling, so in this study, such numbers have been studied. Computational examples explaining the theoretical results have been performed.


A tale about LDA2vec: when LDA meets word2vec

#artificialintelligence

A few days ago I found out that there had appeared lda2vec (by Chris Moody) – a hybrid algorithm combining best ideas from well-known LDA (Latent Dirichlet Allocation) topic modeling algorithm and from a bit less well-known tool for language modeling named word2vec. And now I'm going to tell you a tale about lda2vec and my attempts to try it and compare with simple LDA implementation (I used gensim package for this). What is cool about it? It means that LDA is able to create document (and topic) representations that are not so flexible but mostly interpretable to humans. Also, LDA treats a set of documents as a set of documents, whereas word2vec works with a set of documents as with a very long text string.


Gibbs Max-margin Topic Models with Data Augmentation

Zhu, Jun, Chen, Ning, Perkins, Hugh, Zhang, Bo

arXiv.org Machine Learning

Max-margin learning is a powerful approach to building classifiers and structured output predictors. Recent work on max-margin supervised topic models has successfully integrated it with Bayesian topic models to discover discriminative latent semantic structures and make accurate predictions for unseen testing data. However, the resulting learning problems are usually hard to solve because of the non-smoothness of the margin loss. Existing approaches to building max-margin supervised topic models rely on an iterative procedure to solve multiple latent SVM subproblems with additional mean-field assumptions on the desired posterior distributions. This paper presents an alternative approach by defining a new max-margin loss. Namely, we present Gibbs max-margin supervised topic models, a latent variable Gibbs classifier to discover hidden topic representations for various tasks, including classification, regression and multi-task learning. Gibbs max-margin supervised topic models minimize an expected margin loss, which is an upper bound of the existing margin loss derived from an expected prediction rule. By introducing augmented variables and integrating out the Dirichlet variables analytically by conjugacy, we develop simple Gibbs sampling algorithms with no restricting assumptions and no need to solve SVM subproblems. Furthermore, each step of the "augment-and-collapse" Gibbs sampling algorithms has an analytical conditional distribution, from which samples can be easily drawn. Experimental results demonstrate significant improvements on time efficiency. The classification performance is also significantly improved over competitors on binary, multi-class and multi-label classification tasks.